{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Formulas: Fitting models using R-style formulas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since version 0.5.0, ``statsmodels`` allows users to fit statistical models using R-style formulas. Internally, ``statsmodels`` uses the [patsy](http://patsy.readthedocs.org/) package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the ``patsy`` docs: \n",
    "\n",
    "* [Patsy formula language description](http://patsy.readthedocs.org/)\n",
    "\n",
    "## Loading modules and functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np  # noqa:F401  needed in namespace for patsy\n",
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Import convention"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can import explicitly from statsmodels.formula.api"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from statsmodels.formula.api import ols"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can just use the `formula` namespace of the main `statsmodels.api`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method Model.from_formula of <class 'statsmodels.regression.linear_model.OLS'>>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sm.formula.ols"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or you can use the following convention"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These names are just a convenient way to get access to each model's `from_formula` classmethod. See, for instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method Model.from_formula of <class 'statsmodels.regression.linear_model.OLS'>>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sm.OLS.from_formula"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of the lower case models accept ``formula`` and ``data`` arguments, whereas upper case ones take ``endog`` and ``exog`` design matrices. ``formula`` accepts a string which describes the model in terms of a ``patsy`` formula. ``data`` takes a [pandas](https://pandas.pydata.org/) data frame or any other data structure that defines a ``__getitem__`` for variable names like a structured array or a dictionary of variables. \n",
    "\n",
    "``dir(sm.formula)`` will print a list of available models. \n",
    "\n",
    "Formula-compatible models have the following generic call signature: ``(formula, data, subset=None, *args, **kwargs)``"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## OLS regression using formulas\n",
    "\n",
    "To begin, we fit the linear model described on the [Getting Started](./regression_diagnostics.html) page. Download the data, subset columns, and list-wise delete to remove missing observations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dta = sm.datasets.get_rdataset(\"Guerry\", \"HistData\", cache=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Lottery</th>\n",
       "      <th>Literacy</th>\n",
       "      <th>Wealth</th>\n",
       "      <th>Region</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>41</td>\n",
       "      <td>37</td>\n",
       "      <td>73</td>\n",
       "      <td>E</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>38</td>\n",
       "      <td>51</td>\n",
       "      <td>22</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>66</td>\n",
       "      <td>13</td>\n",
       "      <td>61</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>80</td>\n",
       "      <td>46</td>\n",
       "      <td>76</td>\n",
       "      <td>E</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>79</td>\n",
       "      <td>69</td>\n",
       "      <td>83</td>\n",
       "      <td>E</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Lottery  Literacy  Wealth Region\n",
       "0       41        37      73      E\n",
       "1       38        51      22      N\n",
       "2       66        13      61      C\n",
       "3       80        46      76      E\n",
       "4       79        69      83      E"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                Lottery   R-squared:                       0.338\n",
      "Model:                            OLS   Adj. R-squared:                  0.287\n",
      "Method:                 Least Squares   F-statistic:                     6.636\n",
      "Date:                Tue, 24 Dec 2019   Prob (F-statistic):           1.07e-05\n",
      "Time:                        14:50:45   Log-Likelihood:                -375.30\n",
      "No. Observations:                  85   AIC:                             764.6\n",
      "Df Residuals:                      78   BIC:                             781.7\n",
      "Df Model:                           6                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "===============================================================================\n",
      "                  coef    std err          t      P>|t|      [0.025      0.975]\n",
      "-------------------------------------------------------------------------------\n",
      "Intercept      38.6517      9.456      4.087      0.000      19.826      57.478\n",
      "Region[T.E]   -15.4278      9.727     -1.586      0.117     -34.793       3.938\n",
      "Region[T.N]   -10.0170      9.260     -1.082      0.283     -28.453       8.419\n",
      "Region[T.S]    -4.5483      7.279     -0.625      0.534     -19.039       9.943\n",
      "Region[T.W]   -10.0913      7.196     -1.402      0.165     -24.418       4.235\n",
      "Literacy       -0.1858      0.210     -0.886      0.378      -0.603       0.232\n",
      "Wealth          0.4515      0.103      4.390      0.000       0.247       0.656\n",
      "==============================================================================\n",
      "Omnibus:                        3.049   Durbin-Watson:                   1.785\n",
      "Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694\n",
      "Skew:                          -0.340   Prob(JB):                        0.260\n",
      "Kurtosis:                       2.454   Cond. No.                         371.\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ],
   "source": [
    "mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)\n",
    "res = mod.fit()\n",
    "print(res.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical variables\n",
    "\n",
    "Looking at the summary printed above, notice that ``patsy`` determined that elements of *Region* were text strings, so it treated *Region* as a categorical variable. `patsy`'s default is also to include an intercept, so we automatically dropped one of the *Region* categories.\n",
    "\n",
    "If *Region* had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the ``C()`` operator: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intercept         38.651655\n",
      "C(Region)[T.E]   -15.427785\n",
      "C(Region)[T.N]   -10.016961\n",
      "C(Region)[T.S]    -4.548257\n",
      "C(Region)[T.W]   -10.091276\n",
      "Literacy          -0.185819\n",
      "Wealth             0.451475\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()\n",
    "print(res.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Patsy's mode advanced features for categorical variables are discussed in: [Patsy: Contrast Coding Systems for categorical variables](./contrasts.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Operators\n",
    "\n",
    "We have already seen that \"~\" separates the left-hand side of the model from the right-hand side, and that \"+\" adds new columns to the design matrix. \n",
    "\n",
    "## Removing variables\n",
    "\n",
    "The \"-\" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C(Region)[C]    38.651655\n",
      "C(Region)[E]    23.223870\n",
      "C(Region)[N]    28.634694\n",
      "C(Region)[S]    34.103399\n",
      "C(Region)[W]    28.560379\n",
      "Literacy        -0.185819\n",
      "Wealth           0.451475\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "res = ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()\n",
    "print(res.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multiplicative interactions\n",
    "\n",
    "\":\" adds a new column to the design matrix with the interaction of the other two columns. \"*\" will also include the individual columns that were multiplied together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Literacy:Wealth    0.018176\n",
      "dtype: float64 \n",
      "\n",
      "Literacy           0.427386\n",
      "Wealth             1.080987\n",
      "Literacy:Wealth   -0.013609\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "res1 = ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()\n",
    "res2 = ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()\n",
    "print(res1.params, '\\n')\n",
    "print(res2.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many other things are possible with operators. Please consult the [patsy docs](https://patsy.readthedocs.org/en/latest/formulas.html) to learn more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Functions\n",
    "\n",
    "You can apply vectorized functions to the variables in your model: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intercept           115.609119\n",
      "np.log(Literacy)    -20.393959\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()\n",
    "print(res.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a custom function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intercept               136.003079\n",
      "log_plus_1(Literacy)    -20.393959\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "def log_plus_1(x):\n",
    "    return np.log(x) + 1.\n",
    "res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()\n",
    "print(res.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Any function that is in the calling namespace is available to the formula."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using formulas with models that do not (yet) support them\n",
    "\n",
    "Even if a given `statsmodels` function does not support formulas, you can still use `patsy`'s formula language to produce design matrices. Those matrices \n",
    "can then be fed to the fitting function as `endog` and `exog` arguments. \n",
    "\n",
    "To generate ``numpy`` arrays: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[41.]\n",
      " [38.]\n",
      " [66.]\n",
      " [80.]\n",
      " [79.]]\n",
      "[[1.000e+00 3.700e+01 7.300e+01 2.701e+03]\n",
      " [1.000e+00 5.100e+01 2.200e+01 1.122e+03]\n",
      " [1.000e+00 1.300e+01 6.100e+01 7.930e+02]\n",
      " [1.000e+00 4.600e+01 7.600e+01 3.496e+03]\n",
      " [1.000e+00 6.900e+01 8.300e+01 5.727e+03]]\n"
     ]
    }
   ],
   "source": [
    "import patsy\n",
    "f = 'Lottery ~ Literacy * Wealth'\n",
    "y,X = patsy.dmatrices(f, df, return_type='matrix')\n",
    "print(y[:5])\n",
    "print(X[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To generate pandas data frames: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Lottery\n",
      "0     41.0\n",
      "1     38.0\n",
      "2     66.0\n",
      "3     80.0\n",
      "4     79.0\n",
      "   Intercept  Literacy  Wealth  Literacy:Wealth\n",
      "0        1.0      37.0    73.0           2701.0\n",
      "1        1.0      51.0    22.0           1122.0\n",
      "2        1.0      13.0    61.0            793.0\n",
      "3        1.0      46.0    76.0           3496.0\n",
      "4        1.0      69.0    83.0           5727.0\n"
     ]
    }
   ],
   "source": [
    "f = 'Lottery ~ Literacy * Wealth'\n",
    "y,X = patsy.dmatrices(f, df, return_type='dataframe')\n",
    "print(y[:5])\n",
    "print(X[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                Lottery   R-squared:                       0.309\n",
      "Model:                            OLS   Adj. R-squared:                  0.283\n",
      "Method:                 Least Squares   F-statistic:                     12.06\n",
      "Date:                Tue, 24 Dec 2019   Prob (F-statistic):           1.32e-06\n",
      "Time:                        14:51:30   Log-Likelihood:                -377.13\n",
      "No. Observations:                  85   AIC:                             762.3\n",
      "Df Residuals:                      81   BIC:                             772.0\n",
      "Df Model:                           3                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "===================================================================================\n",
      "                      coef    std err          t      P>|t|      [0.025      0.975]\n",
      "-----------------------------------------------------------------------------------\n",
      "Intercept          38.6348     15.825      2.441      0.017       7.149      70.121\n",
      "Literacy           -0.3522      0.334     -1.056      0.294      -1.016       0.312\n",
      "Wealth              0.4364      0.283      1.544      0.126      -0.126       0.999\n",
      "Literacy:Wealth    -0.0005      0.006     -0.085      0.933      -0.013       0.012\n",
      "==============================================================================\n",
      "Omnibus:                        4.447   Durbin-Watson:                   1.953\n",
      "Prob(Omnibus):                  0.108   Jarque-Bera (JB):                3.228\n",
      "Skew:                          -0.332   Prob(JB):                        0.199\n",
      "Kurtosis:                       2.314   Cond. No.                     1.40e+04\n",
      "==============================================================================\n",
      "\n",
      "Warnings:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
      "[2] The condition number is large, 1.4e+04. This might indicate that there are\n",
      "strong multicollinearity or other numerical problems.\n"
     ]
    }
   ],
   "source": [
    "print(sm.OLS(y, X).fit().summary())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
